LDA and QDA: Handwritten Digits

Let’s continue with the digits data. We read-in the data:

url <- "https://raw.githubusercontent.com/datasciencelabs/data/master/hand-written-digits-train.csv"

#if(!exists("digits")) digits <- read_csv(url)
digits <- read_csv(url)

To simplify the problem we will try to distinguish 2s from 7s. So we subset to only those digits.

dat <- digits %>% filter(label %in% c(2,7))

For illustrative purposes we created two features: X_1 is the percent of non-white pixels that are in the top left quadrant and X_2 is the percent of non-white pixels that are in the bottom right quadrant:

We can create these new predictors like we did before:

dat <- mutate(dat, label =  as.character(label)) %>% 
       mutate(y = ifelse(label == "2",0,1 ))
row_column <- expand.grid(row=1:28, col=1:28)
ind1 <- which(row_column$col <= 14 & row_column$row <= 14)
ind2 <- which(row_column$col > 14  & row_column$row >  14)
ind <- c(ind1,ind2)
X <- as.matrix(dat[,-1])
X <- X > 200
X1 <- rowSums(X[,ind1])/rowSums(X)
X2 <- rowSums(X[,ind2])/rowSums(X)
dat <- mutate(dat, X_1 = X1, X_2 = X2)

We can see some examples of what these predictors (pixels) are:

We act as if we know the truth (the boundary that we should use to distinguish 2s from 7s):

Quadratic and Linear Discriminant Analysis

For illustration purposes let’s take a subset of 1,000 handwritten digits:

set.seed(1)
dat <- sample_n(dat, 1000) %>% dplyr::select(y, X_1, X_2)
head(dat)
## # A tibble: 6 × 3
##       y    X_1   X_2
##   <dbl>  <dbl> <dbl>
## 1     1 0.238  0.417
## 2     0 0.0610 0.293
## 3     0 0.151  0.315
## 4     0 0.0957 0.255
## 5     0 0.189  0.256
## 6     0 0      0.234

Now create train and test sets:

inTrain   <- createDataPartition(y = dat$y, p = 0.5, times = 1, list = FALSE)
train_set <- slice(dat, inTrain)
test_set  <- slice(dat, -inTrain)

Quadratic Discriminant Analysis (QDA) relates to the Naive Bayes approach we described earlier. We try to estimate \(\mbox{Pr}(Y=1|X=x)\) using Bayes theorem.

\[ p(x) = \mbox{Pr}(Y=1|\mathbf{X}=\mathbf{x}) = \frac{\pi f_{\mathbf{X}|Y=1}(x)} {(1-\pi) f_{\mathbf{X}|Y=0}(x) + \pi f_{\mathbf{X}|Y=1}(x)} \]

With QDA, we assume that each predictor is normally distributed and allow for the decision boundary to be curved (quadratic). Here, we have two predictors, so we need to estimate two averages, two standard deviations, and a correlation for each case: \(Y=1\) and \(Y=0\):

#options(digits = 2)
params <- train_set %>% group_by(y) %>% 
          summarize(avg_1 = mean(X_1), avg_2 = mean(X_2), sd_1= sd(X_1), sd_2 = sd(X_2), r = cor(X_1,X_2))
params
## # A tibble: 2 × 6
##       y avg_1 avg_2   sd_1   sd_2     r
##   <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
## 1     0 0.130 0.286 0.0674 0.0592 0.436
## 2     1 0.232 0.292 0.0750 0.107  0.360

We can use these parameter estimates to calculate the QDA estimates by hand, but we can also perform QDA automatically using the qda function from the MASS package.

qda_fit <- qda(y ~ X_1 + X_2, data = train_set)
qda_preds <- predict(qda_fit, test_set)

This defines the following estimate of \(f(x_1, x_2)\)

This fits the “truth” extremely well. Let’s see how well we perform on the test set. Note that the class prediction (0 or 1) can be accessed using $class after using the predict function to make predictions and the class probabilites can be accessed using $posterior. We get an overall accuracy of 0.85 and balanced sensitivity and specificity.

confusionMatrix(data = as.factor(qda_preds$class), reference = as.factor(test_set$y))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 219  35
##          1  40 206
##                                           
##                Accuracy : 0.85            
##                  95% CI : (0.8156, 0.8802)
##     No Information Rate : 0.518           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6998          
##                                           
##  Mcnemar's Test P-Value : 0.6442          
##                                           
##             Sensitivity : 0.8456          
##             Specificity : 0.8548          
##          Pos Pred Value : 0.8622          
##          Neg Pred Value : 0.8374          
##              Prevalence : 0.5180          
##          Detection Rate : 0.4380          
##    Detection Prevalence : 0.5080          
##       Balanced Accuracy : 0.8502          
##                                           
##        'Positive' Class : 0               
## 

Here we have 2 predictors and had to compute 4 means, 4 SDs and 2 correlations. How many parameters would we have if instead of 2 predictors we had 10?

The main problem comes from estimating correlations for 10 predictors. With 10, we have 45 correlations for each class. In general the formula is \(p(p-1)/2\) which gets big quickly.

A solution to this is to assume that the correlation structure is the same for all classes. This reduces the number of parameters we need to estimate. When we do this, we can show mathematically that the solution is “linear”, in the linear algebra sense, and we call it Linear Discriminant Analysis (LDA).

params <- train_set %>% group_by(y) %>% 
  summarize(avg_1 = mean(X_1), 
            avg_2 = mean(X_2), 
            sd_1= sd(X_1), 
            sd_2 = sd(X_2), 
            r = cor(X_1,X_2))

params <- params %>% mutate(sd_1 = mean(sd_1), 
                            sd_2=mean(sd_1), 
                            r=mean(r))
params 
## # A tibble: 2 × 6
##       y avg_1 avg_2   sd_1   sd_2     r
##   <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
## 1     0 0.130 0.286 0.0712 0.0712 0.398
## 2     1 0.232 0.292 0.0712 0.0712 0.398

With this simplification we gain computation speed and feasibility, but lose accuracy. We can fit an LDA model automatically using the lda function, similar to the qda function. We see a dip in overall accuracy to 0.816.

lda_fit <- lda(y ~ X_1 + X_2, data = train_set)
lda_preds <- predict(lda_fit, test_set)
confusionMatrix(data = as.factor(lda_preds$class), reference = as.factor(test_set$y))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 207  40
##          1  52 201
##                                          
##                Accuracy : 0.816          
##                  95% CI : (0.7792, 0.849)
##     No Information Rate : 0.518          
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.6322         
##                                          
##  Mcnemar's Test P-Value : 0.2515         
##                                          
##             Sensitivity : 0.7992         
##             Specificity : 0.8340         
##          Pos Pred Value : 0.8381         
##          Neg Pred Value : 0.7945         
##              Prevalence : 0.5180         
##          Detection Rate : 0.4140         
##    Detection Prevalence : 0.4940         
##       Balanced Accuracy : 0.8166         
##                                          
##        'Positive' Class : 0              
## 

This defines the following estimate of \(f(x_1, x_2)\) and the boundary becomes linear:

ROC

With the example we have been examining we can make two types of errors: calling a 2 a 7 or calling a 7 a 2. More generally, with binary data we call these false positives (calling a 0 a 1) and false negatives (calling a 1 a 0). Here we have arbitrarily made 7s 1s and 2s 0s.

This concept is important in many areas and in particular in health where one type of mistake can be much more costly than another. Note that we have been predicting 1s based on the rule \(\hat{f}(x_1, x_2) > 0.5\) but we can pick another cutoff, depending on the cost of each type of error. For example, if we are predicting if a plane will malfunction, then we want a very low false negative rate and are willing to sacrifice our true positive rate.

We can see that the estimated probabilities are on a continuum:

pred_qda <- qda_preds$posterior[,2]
test_set %>% mutate(pred=pred_qda, label=as.factor(y)) %>%
  ggplot(aes(label,pred)) + geom_boxplot()

The Receiver Operator Characteristic Curve (ROC Curve) plots true positive rate versus false positive rate for several choices of cutoff. We can create this curve using base R with the following code:

library(pROC)

roc_qda <- roc(test_set$y, pred_qda)
plot(roc_qda)

We can also create a nicer looking plot using ggroc:

ggroc(list("QDA" = roc_qda)) +
  theme(legend.title = element_blank()) +
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "black", linetype = "dashed") +
  xlab("Specificity") +
  ylab("Sensitivity") 

Here are the results for LDA

pred_lda <- lda_preds$posterior[,2]
roc_lda <- roc(test_set$y, pred_lda)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ggroc(list("QDA" = roc_qda, "LDA" = roc_lda)) +
  theme(legend.title = element_blank()) +
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "black", linetype = "dashed") +
  xlab("Specificity") +
  ylab("Sensitivity") 

We can also compare to KNN with two different values of k: 5 and 51.

fit <- knn3(y~., data = train_set, k = 5)

pred_knn_5 <- predict(fit, newdata = test_set)[,2]
roc_knn_5 <- roc(test_set$y, pred_knn_5)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
fit <- knn3(y~., data = train_set, k = 51)
pred_knn_51 <- predict(fit, newdata = test_set)[,2]
roc_knn_51  <- roc(test_set$y, pred_knn_51)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ggroc(list("QDA" = roc_qda, "LDA" = roc_lda, "kNN, k = 5" = roc_knn_5, "kNN, k = 51" = roc_knn_51)) +
  theme(legend.title = element_blank()) +
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "black", linetype = "dashed") +
  xlab("Specificity") +
  ylab("Sensitivity")

To summarize these curves into one single number that can be compared across methods, it is common to take the area under the ROC curve (abbreviated AUC or AUROC). Higher values indicate better performance.

auc(roc_qda)
## Area under the curve: 0.9226
auc(roc_lda)
## Area under the curve: 0.8934
auc(roc_knn_5)
## Area under the curve: 0.8822
auc(roc_knn_51)
## Area under the curve: 0.9171

Three classes

Here we look at an example where there are three classes to consider. In addition to 2s and 7s, we also consider 1s. Now we need to estimate three different curves that represent the conditional probabilities of being each digit given the values of \(X_1\) and \(X_2\).

We’ll create a training set and test set so we can test out how each method does on the three class problem.

set.seed(1)
dat <- sample_n(dat, 3000) %>% dplyr::select(label, X_1, X_2) %>% dplyr::mutate(label = as.factor(label))
inTrain   <- createDataPartition(y = dat$label, p = 0.5, times = 1, list = FALSE)
train_set <- slice(dat, inTrain)
test_set  <- slice(dat, -inTrain)
train_set %>% ggplot(aes(X_1,X_2,fill=label)) + 
  geom_point(cex=3, pch=21)

LDA on three classes

This defines the following estimate of \(f(x_1, x_2)\) and the boundary becomes linear:

fit_lda = lda(label ~ X_1 + X_2, data = train_set)

QDA on three classes

This defines the following estimate of \(f(x_1, x_2)\) and the boundary becomes non-linear:

fit_qda = qda(label ~ X_1 + X_2, data = train_set)

GLM on three classes

Let’s compare the performance of LDA and QDA with logistic regression. For each label, we will train a model to predict that number, or not that number. So fit1 predicts 1 vs not 1, fit2 predicts 2 vs not 2, and fit7 predicts 7 vs not 7. Then we can also plot these boundaries.

fit1 <- glm(y~X_1+X_2, data=mutate(train_set,
                                   y=label=="1"),family="binomial")
fit2 <- glm(y~X_1+X_2, data=mutate(train_set,
                                   y=label=="2"),family="binomial")
fit7 <- glm(y~X_1+X_2, data=mutate(train_set,
                                   y=label=="7"),family="binomial")

kNN on three classes

Let’s also use kNN with k = 51 and k = 101 and draw the boundaries.

fit_51 <- knn3(label ~ ., data = train_set, k = 51)

fit_101 <- knn3(label ~ ., data=train_set, k = 101)

Comparison

After training each model we can predict labels for the test set and then compare the accuracy of each method.

##QDA
pred_qda <- predict(fit_qda, test_set)

##LDA
pred_lda <- predict(fit_lda, test_set)

##GLM
f_hat1 <- predict(fit1, newdata = test_set, type = "response")
f_hat2 <- predict(fit2, newdata = test_set, type = "response")
f_hat7 <- predict(fit7, newdata = test_set, type = "response")

pred_glm <- apply(cbind(f_hat1, f_hat2, f_hat7), 1, which.max)

##KNN 51
f_hat <- predict(fit_51, newdata = test_set)
pred_knn_51 <- apply(f_hat, 1, which.max)

##KNN 101
f_hat <- predict(fit_101, newdata = test_set)
pred_knn_101 <- apply(f_hat, 1, which.max)

Let’s compare:

tab <- table(factor(pred_lda$class, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)
## Confusion Matrix and Statistics
## 
##    
##       1   2   7
##   1 403 202  28
##   2  50 151  75
##   7  78 109 403
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6384          
##                  95% CI : (0.6135, 0.6628)
##     No Information Rate : 0.3542          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4528          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 7
## Sensitivity            0.7589   0.3268   0.7964
## Specificity            0.7624   0.8795   0.8117
## Pos Pred Value         0.6367   0.5471   0.6831
## Neg Pred Value         0.8522   0.7457   0.8867
## Prevalence             0.3542   0.3082   0.3376
## Detection Rate         0.2688   0.1007   0.2688
## Detection Prevalence   0.4223   0.1841   0.3936
## Balanced Accuracy      0.7607   0.6031   0.8041
confusionMatrix(tab)$overall[1]
##  Accuracy 
## 0.6384256
tab <- table(factor(pred_qda$class, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)$overall[1]
##  Accuracy 
## 0.7431621
tab <- table(factor(pred_glm, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)$overall[1]
##  Accuracy 
## 0.6124083
tab <- table(factor(pred_knn_51, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)$overall[1]
##  Accuracy 
## 0.7464977
tab <- table(factor(pred_knn_101, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)$overall[1]
##  Accuracy 
## 0.7444963